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Abstract 

Linear models with a growing number of parameters have been widely used 
in modern statistics. One important problem about this kind of model is 
the variable selection issue. Bayesian approaches, which provide a stochastic 
search of informative variables, have gained popularity. In this paper, we will 
study the asymptotic properties related to Bayesian model selection when the 
model dimension p is growing with the sample size n. We consider p < n 
and provide sufficient conditions under which: (1) with large probability, 
the posterior probability of the true model (from which samples are drawn) 
uniformly dominates the posterior probability of any incorrect models; and 
(2) the posterior probability of the true model converges to one in probability. 
Both (1) and (2) guarantee that the true model will be selected under a 
Bayesian framework. We also demonstrate several situations when (1) holds 
but (2) fails, which illustrates the difference between these two properties. 
Finally, we generalize our results to include ^^-priors, and provide simulation 
examples to illustrate the main results. 



Keywords: Bayesian model selection; growing number of parameters; Posterior model 
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1. Introduction 

This work was motivated by efforts to analyze remotely sensed (satellite) 
data which consists of multiple spatial images. In the setting of interest, one 
image corresponds to a "response" while others correspond to covariates. 
To find the relationship between the response and covariate spatial images, 
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Zhang et al. (2010) proposed a functional concurrent linear model with vary- 
ing coefficients and applied a wavelet approach to transform this model into 
a linear model (with a particular design matrix) which contains an n-vector 
of responses and a sparse p-vector of wavelet coefficients. Since the images 
contain thousands of pixels, the model dimension p, which is determined by 
the maximum decomposition level in the wavelet expansion, has to be large 
so that sufficiently fine details in the target images can be captured. On the 
other hand, p has an upper bound p < {K + l)n, where K is the total number 
of covariate images involved in the model. This is because each spatial image 
corresponds to a vector of wavelet coefficients which has dimension not ex- 
ceeding n, and there are K + 1 images in total with one of them representing 
the intercept and others the slopes. An important question is how to select 
the nonzero coefficients in the model, which is essentially a variable selection 
problem. Zhang et al. (2010) adopted a Lasso approach to address this. 

The problem they handle relies on a specific design matrix induced by the 
wavelet structure. It is of interest, to frame the variable selection problem 
more broadly. More precisely, we suppose that data are drawn from the 
linear model 



where e ~ A^(0, crg/n) is an n- vector of errors, y = {yi, . . . ,yn)^ is an n- 
vector of responses, (3 = {(3i, . . . , /3p)'^ is a p- vector of parameters and X = 
{Xi, . . . ,Xp) is a n X p design matrix with Xj the jth column of X. It is 
also assumed that only a subset of Xi, . . . , Xp contribute to y and we are 
interested in selecting the variables in this subset. 

We consider a Bayesian variable selection (BVS) approach based on model 
(11.11) . The Bayesian model to be considered is a variation of George and Mc- 
CuUoch (1993) and has been studied by Clyde et al. (1998), Clyde and 
George (2000), and Wolfe et al. (2004). Clearly, each subset of Xi, . . . ,Xp 
defines a candidate model, so there are 2^ of them in total. According to 
George and McCuUoch (1993), all the marginal posterior probabilities of 
these 2^ models can be calculated and the model with the largest posterior 
probability can be selected as the "best" model. This motivates the formal 
definition of posterior model consistency (PMC). We say that PMC holds if 
the true model, defined as the model from which samples are drawn, has a 
posterior probability approaching one. Since the sum of the posterior proba- 
bilities of all models equals one, when PMC holds, the posterior probability 
of any incorrect model will go to zero when n goes to infinity so that the true 
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model can be correctly selected. 

PMC has been theoretically verified when p is fixed (see Fernandez et 
al, 2001; Moreno and Giron, 2005; Liang et al, 2008; Casella et al, 2009). 
However, fewer results have been derived when p is growing with n, an in- 
teresting and important scenario. For increasing p, Berger et al. (2003), 
Moreno et al. (2010) and Giron et al. (2010) proved consistency for Bayes 
factors. Although PMC and consistency of Bayes factors are equivalent for 
fixed p (see Liang et al, 2008; Casella et al, 2009), they are different for 
growing p. Actually, we will see below that consistency of the Bayes factor is 
equivalent to consistency of the posterior odds ratio under a general setting, 
but that the latter form of consistency is weaker than PMC. Therefore, it 
seems valuable to separately study PMC. 

In this paper we will consider two classes of design matrix X, both with 
p < n, although our results can be generalized to p ^ n when combined 
with certain dimension reduction approaches. In the first case, X is quite 
general. A representative situation is that the eigenvalues of X'^X/n are 
uniformly bounded both above and below. Consistency is examined when p 
grows slower than n, say, plogn = o{n). We find that the posterior odds in 
favor of any incorrect model uniformly converges to zero, and the posterior 
probabihty of the true model converges to one. A second case we consider 
occurs when X'^X/n is the identity matrix, i.e., X^X — nip, and p grows 
as fast as n, say p — n. In that case, consistency of the posterior odds 
ratio and PMC are examined, i.e., the posterior odds ratio in favor of any 
incorrect model uniformly converges to zero, and the posterior probability of 
the true model converges to one. We also demonstrate how consistency of the 
posterior odds ratio can hold even though PMC fails. Finally, we generalize 
our results to a (7-prior setting proposed firstly by Zellner (1986). 

The remainder of this paper is organized as follows. In Section 2, prelim- 
inaries and main results will be provided. In Section 3, a numerical example 
related to the results of Section 2 is displayed. Section 4 contains the con- 
clusion. And technical arguments are included in Section 5. 

2. Preliminaries and main results 

Suppose the n dimensional response vector y = (yi, . . . , y^Y a-nd the n 
by p covariate matrix X — {Xi, . . . , Xp) are linked by the model 

y = Xl3 + e, (2.1) 
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where the X^'s are n-vectors, (3 = . . . , Pp)'^ is an unknown p-vector and e 
is a vector of random errors. Here, X is allowed to be either (1) random but 
independent of e or (2) deterministic. For 1 < j < p, define the state variable 
of (3j by 7j = I{f3j ^ 0) and 7 = (71, . . . ,7p)'^, where /(■) is the indicator 
function. We call 7 the state vector of /3 and denote the number of I's in 7 
by I7I. The state vector 7 completely determines the inclusion or exclusion 
of /3j's in model (12. ip . and therefore, can define a model y = X^(3^ + e, where 

is an n X I7I submatrix of X whose columns are indexed by the nonzero 
components of 7, and (3^ is the subvector (with size I7I) of (3 indexed by the 
nonzero components of 7. It is natural, therefore, to call each 7 a model. 
Note that there are 2^ such 7's representing 2^ different models. For any 
state vectors 7 and 7', let (7\7')j = Hij = l)7j = 0) denote the difference 
(which is also a state vector) between 7 and 7', i.e., the 0-1 vector indicating 
the variables that are present in 7 but absent in 7'. We say that 7 is nested in 
7' (denoted by 7 C 7') if 7\7' = 0. Denote the true model coefficient vector 
by /3° and the corresponding state vector by 7°, and let s„ = |7°| denote the 
size of the true model. 

In this paper we consider the following hierarchical Bayesian model which 
is a variation of the model used by George and McCulloch (1993) 

y\f3,a'r^N{Xf3,a'Q, 

/3,|7,-, ^2 ~ (1 - 7^.)5o + ljN{0, Cja^), 

7~P(7), (2.2) 

where is point mass measure concentrated at zero. Hereafter, u will be 
fixed a priori. Let S = diag(c) with c = {cj)i<j<p a vector of positive com- 
ponents, and let be the I7I x I7I sub-diagonal matrix of S corresponding 
to 7. Let Z = {y,X) denote the full data set. It follows by integrating out 
/3 and a that the posterior distribution of 7 is given by 

U + y^ [In- X^U^^XJ^)y } 

(2.3) 

where = -|- X^X^ and W.y = S^^f/^Sy^. In particular, if 7 = (the 
null model containing no covariate variables), (12. 3 p still holds if we adopt the 
conventions that = and E0 = [/g = PV© = 1. 



4 



Define Si = {7|7° C 7,7 7^ 7°} and 5*2 = {7|7° is not nested in 7}. It 
is clear that S{n) defined by S{n) = Si U 6*2 U {7°} is the class of all state 
vectors. In particular, when 7° = 0, 5*2 is empty, and hence Si is the class of 
all state vectors excluding 7°. As was found by Liang et al. (2008), we will 
see later in this section that whether 7° is null or nonnull will result in some 
differences in the main results (especially in the assumptions that are needed 
to establish our main results); thus, we will treat these cases separately. 

When 7° is nonnull, we denote (pram{n) = mmA_ (j^X'^os^^^I^ — P.y)XjO\^^ 

and V5max(^) = maxA+ (^XK\ X^o\^), where = X^{X'^X^)-'^X^ is a 

projection matrix, X-{A) and X+{A) are the minimal and maximal eigenval- 
ues of the square matrix A. We also adopt the convention that P0 = 0. For 
the case that 7° = 0, both ipuim and v^max are meaningless, and will be 
focused on in this situation. 

Before proceeding further, we introduce several types of consistency cen- 
tral to this work. Generally speaking, to make a correct model selection 

max p(7|Z)/p(7°|Z) -> (2.4) 

should hold as n — )■ 00, which means that the posterior probability of the 
true model asymptotically dominates that of any incorrect model. Following 
a framework similar to that of Zellner (1978), the termp(7|Z)/p(7°|Z), which 
is called the posterior odds ratio in favor of 7, satisfies the relationship 

ph\Z)/p{^'\Z) = BF{^ : 7°)^, (2.5) 

where BF{'j : 7°) := p{Z\'y)/p{Z\'y^) is the Bayes factor of 7 versus 7° and 
^(7) /j9(7°) is the prior odds ratio in favor of 7. The Bayes factor is consistent 
if for any 7 7^ 7°, BF{^ : 7°) — i- 0. The posterior odds ratio is consistent if 
for any 7 7^ 7°, p{;~^\Z) /p{^^\Z) 0. It is easy to see that property (12. 4p 
implies consistency of the posterior odds ratio. We say that posterior model 
consistency (PMC) holds if p(7°|Z) — )■ 1. These types of consistency all have 
been useful in Bayesian model selection. Representative references include 

(1) assessment of posterior odds ratio: Jeffreys (1967), Zellner (1971, 1978); 

(2) performance of Bayes factor: Berger and Pericchi (1996), Moreno et al. 
(1998, 2010), Casella et al. (2009); (3) PMC: Fernandez et al. (2001), Liang 
et al. (2008). 
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It is easy to see that when 

c"^ < mmp(7)/p(7°) < maxp(7)/p(7°) < c (2-6) 

7 7 

holds for some positive constant c, consistency of the Bayes factor is equiv- 
alent to consistency of the posterior odds ratio, and that both are weaker 
than (12.41) ■ A special case is that p{'y) = for all 7's, which results in an 
indifference prior distribution for 7, see, e.g.. Smith and Kohn (1996). 
To illustrate the relationship between PMC and (12. 4p . note that 

77^7° 

and thus p{'~f^\Z) — )■ 1 will imply (12. 4p . When p is fixed, it has been noted by 
Liang et al. (2008) that (12. 4 p implies PMC. However, when p grows with n, 
it will be shown later that this may not be true. This somewhat illustrates 
the difference between PMC and (12.40 . 

In what follows, we introduce some regularity conditions that are useful 
to establish our main results. We will also demonstrate some particular 
situations when these conditions are satisfied. 

Assumption 2.1. There exists a constant Cq > such that for any n, 
max p(7)/]5(7°) < Cq. 

7GS{n) 

Assumption 2.2. There exist positive constants Ci, C2 such that with prob- 
ability equal to one, liminf V5min(^) > Ci and limsupv9max(^) < C2- 

" n 

Assumption 2.3. There exists a positive sequence ipn such that min > 
ipn and, as n — )■ 00, ipny/n — )■ 00. 

Assumption 2.4. — ^ 00, s„ < p„ < n and p„logn = o(n log(l + 
min{^2^1})). 

Assumption 2.5. p„ — t- 00, s„ < p„ < and Pn logp„ = o{n). 

Hereafter, unless otherwise explicitly stated, we will drop the subscript 
from Pn. 
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Assumption 2.6. There is a positive sequence 0„ = O(n'^o) for some (5o > 

such that max Cj < 0„, where the the hyperparameters (in model 

i<i<p 

(12. 2p ) controlhng the prior variances of the nonzero /3j's. 

Assumption 2.7. There is a positive sequence (p such that /c^ = 0(0 ) 

and min Cj > (j) , where fcn = ||/5°o||2- 

i<j<p — " ^ 

Assumption 2.8. There exist C3 > and 6 > such that n^~'^(f>^ — )■ 00, 
and for any n, with probability equal to one, 

inf A_ (^X;\y>(/n - Py>)X^\^o^ > Csn-'. (2.8) 



7' 

Remark 2.1. 

(a) . Assumption 12.11 is satisfied by some commonly used priors ^(7), such 

as the fiat prior ^(7) = 2"^ (Smith and Kohn, 1996). More generally, 

if p(7j = 1) = 6'j is such that both Yl (l^) 11 \~T^) 

bounded, then Assumption 12.11 is satisfied. 

(b) . We use Assumption 12.31 to prove consistency for a growing p. Fan 

and Peng (2004) introduced a similar assumption in the framework 
of smoothly clipped absolute deviation (SCAD) penalized optimization 
where ^/n in Assumption 12.31 was replaced by 1 / A„ with A„ the penalty 
parameter. This condition requires the true parameters to be away from 
zero. Otherwise, it is impossible to distinguish between zero and nonzero 
parameters. 

(c) . Assumptions 12.41 and 12.51 define a rate on the dimension p. In particular, 

when inf ipn > 0, Assumption 12.41 is satisfied if s„ < p and p log n = o(n). 

n 

The results hold when s„ is either bounded or growing with n. 

(d) . Assumption 12.61 excludes the possibility that 0„ is extremely large, e.g., 

we exclude the situation that 0„ = exp(n'^) for some u > 0. Assump- 
tion [221 requires that 0^ is not growing slower than kn = ||/3°o||2- When 
the design matrix X is nonorthogonal, we use this assumption to facil- 
itate the proof of consistency (see Theorem 12.21 below). But when X 
is orthogonal, this assumption is redundant and can be removed (see 
Corollary 12.51 below). □ 
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Assumptions 12. 1^ I2.3H2.7I are easily satisfied. The following proposition 
demonstrates that a broad class of design matrices X can satisfy Assumptions 
2.2 and 2.8. 

Proposition 2.1. If the n x p matrix X satisfies A_ (^X-^X) > c, where 
c > is constant, then for any 7 C 7 and 7 7^ 7, 

A_ - ^7)^7X7) > c. (2.9) 

The proof of Proposition 12.11 can be found in Section 5 (Appendix). 

Remark 2.2. Proposition 12.11 demonstrates that Assumptions 12.2] and 
12.81 can hold under general classes of design matrices. One such class consists 
of matrices X satisfying 

1/c < A_ (^^X^X^ < A+ (^^X^X^ < c, (2.10) 

where c is some positive constant. For any 7 G Si, we will have that 7*^ C 7 
and 7° 7^ 7. Thus, by Proposition 12. H A_ (^^X'^^^o{In — -Ryo)X^\y)j > 1/c, 
i.e., inequality (12. 8p in Assumption 12.81 holds. Notice that when 7 G 5*2, the 
relationship 7 C 7*^ V 7 and 7 7^ 7*^ V 7 holds, where 7" V 7 denotes the 
p- vector with jth component the larger of (7°)^ and 7^, then Assumption 12.21 
follows by applying Proposition 12.11 □ 
In the following text, we assume that data are generated from the true 
model y = X/3° + e with e ~ N{0, ctq/^). Let 7° be the p-dimensional state 
vector corresponding to Unless otherwise stated, the limits in our main 
results will be taken when n — i- 00. 

Theorem 2.2. Suppose that 7° is nonnuU and Assumptions I2.1H2.4I and 
[2:6^2:8] are satisfied. Let 5 > satisfy Assumption EH If = o{n^~^(j^J 
for some > 2, then sup max p(7|Z)/p(7°|Z) — )• in probability. 

Cl,...,Cpe[(p^,4>n] 77^7 

If p"o+2 = o{n^^^(f) ) for some > 2, then sup Yl Pill^) — > in 

ci,...,Cp€l<P^,4>„] 7^7" 

probability, and consequently, inf _ p(7°|Z) — > 1 in probability. 

ci,...,cpe[^^,<^„] 

The proof of Theorem 12.21 follows by first deriving asymptotic approx- 
imations of the posterior odds ratios p(7|Z)/p(7°|Z) for any 7 7^ 7°, and 
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then using these approximations to show that ^ p{'y\Z)/p{'-f^\Z) — )■ in 

77^7" 

probabihty. The hmit p{'y^\Z) — > 1 (in probabihty) thus immediately follows 
from (12. 7p . Details are in the Appendix. 

Remark 2.3. Theorem 12.21 provides sufficient conditions under which 
(12.41) and PMC are satisfied. It asserts that, with large probability, uniformly 
for Cj's G p(7°|Z) dominates pi'jlZ) for any 7 7^ 7°, and p(7°|Z) 

approaches one in probabihty. Thus, with large probability, the true model 
7*^ will be selected from a Bayesian perspective. □ 

Remark 2.4. When combined with certain dimension reduction tech- 
niques such as sure independence screening (SIS) proposed by Fan and Lv 
(2008), one can generalize Theorem l2.2l to the ultra-high dimensional setting, 
i.e., p ^ n. This framework has been explored by many authors from non- 
Bayesian perspectives (see, e.g., Meinshausen and Biihlmann, 2006; Mein- 
shausen and Yu, 2009; Zhang and Huang, 2010; Biihlmann and Kalisch, 
2010). Here, we explore it by a Bayesian way. The basic idea is to first 
reduce the high-dimensional linear model so that the model dimension is be- 
low n, and then apply Bayesian model (12. 2 p to this reduced linear model. 
Under suitable conditions and using the arguments similar to the proof of 
Theorem 12. 2[ one can show that the posterior probability of the true model 
based on the reduced linear model converges in probability to 1. We refer to 
Supplement A for the description of this result and details of the proof. □ 

The following result is an application of Theorem 12.21 in a special setting, 
which allows the growth rate of p to be plogn = o{n). 

Corollary 2.3. Suppose that 7° is nonnull and Assumptions 12. H 12.21 and 
inequality (12. 8p are satisfied. Assume that min > ipn with inf ipn > 0, 

and p satisfies plogn = o(n). Let 5 > be as specified in inequality (12. Sp 
and suppose there exists a constant 6q with 6q > 3 + 6 such that kn = 
0{n^°). Then with the selection 0„ = 0{n^°) and = 0{(j)^), we have 

inf _ p{'~f^\Z) —7- 1 in probability. 

ci,...,cpe[^^ ,(/)„] 

The proof of Corollary 12.31 can be finished by choosing ao G (2, — 5 — 1) 
and verifying the assumptions in Theorem 12.21 

Theorem 12.21 deals with the case when the true model is nonnull. If the 
true model is null, then the response vector y will have a zero mean. The 
corresponding result is summarized below. 
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Theorem 2.4. Suppose 7° is null, i.e., y = e ~ N{0,aQln), and that 
Assumptions 12.11 and I2.5H2.8I are satisfied. If = o{n^~^(f) ) for some 
ao > 2, then sup max p{'y\Z)/p{'y^\Z) — > in probability. Ifp"""*"^ = 

o^n^'^cj) ) for some ao > 2, then sup Yl Pill^) — > in probability, 

ci,...,cpe[^^,0„] 7^70 

and consequently, inf _ p{'-f^\Z) — )■ 1 in probability. 

The proof of Theorem 12.41 is similar to Theorem 12.21 and can be found in 
Appendix. 

Although it is valid for a general type of design matrix. Theorem 12.21 
requires that p grows slower than n. More precisely, if the Assumptions 
in Theorem 12.21 are satisfied, then p = o{n) . To see this, we notice that 
Assumptions 12.61 12.71 and the fact that ipn < kl/'^ lead to ipn = 0{n^°) for 
some ^0 > 0. Therefore, p = o{n) follows from Assumption 12. 4[ In order 
to obtain consistency when p may grow as fast as n, one idea, but not the 
weakest possible, is to assume orthogonality of X, i.e., X'^X = nip, and 
to relax Assumption 12.71 To simplify the technical proof, we assume in the 
following Corollaries 12.51 and 12.61 that all c/s in model (12. 2p are equal to 
Moreover, we need the following assumption about the growth rates of Sn 
and p to replace Assumptions 12.41 and 12.51 

Assumption 2.9. Let a„ = + a^'^kn/ {n~^ + (/)„) and C G (1, C)o) be a 
constant such that mp^ > CTgCan as n — )■ 00. The numbers p and s„ with 
p — )■ 00 and Sn < p < n satisfy 

«• = o (min { , nr^,n}) . 

(ii). plogp = o{an). 

Assumption 12.91 potentially allows the case p = n. To see this, suppose 
Sn = 0(1) and we choose such that (n + u)/ log(l + TKpn) — ^ 00. When a„ 
grows faster than nlogn and flip's/an 00, p = n will satisfy Assumption 
12.91 However, this requires tp"^ to grow at least faster than logn. This extra 
requirement on if)"^ has not been imposed by Theorems 12.21 and 12. 4[ and can 
be treated as the price which we pay to relax the growth rate for p. Under 
Assumption 12.91 and assuming orthogonality on X, we have the following 
consistency result which allows a faster growth rate for the dimension p. 
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Corollary 2.5. Assume that X'^X = nip and S = with n0„ — )■ 

oo and log0„ = O(logn). Suppose 7° is nonnull and that Assumptions 
I2TT] and [2]9] are satisfied. If p°'o{n+u)/a„ ^ o{n(l)n) for some ao > 2, then 

max p{'~)\Z) / p{'~^^\Z) — )■ in probability. \i p = o {^n + z/) log ^ ^"^^ with 

C specified in Assumption 12. 9[ and p2+oo(n+!^)/an _ o{n(f)n) for some ao > 2, 
then ^ p{'^\Z) — )■ in probability, and consequently, p(7°|Z) — )■ 1 in prob- 

ability. 

The proof of Corollary 12.51 is similar to those for Theorems 12.21 and 12.41 
and is given in Supplement B. The following result, which requires a special 
model set-up, demonstrates that PMC and consistency of the posterior odds 
ratio may hold in some situations but fail in others. 

Corollary 2.6. Assume p = n, X'^X = nin and S = Suppose 

min|/3°| > with ip"^ = Cin^+^^ (logn)^ for some constants 61 > 1 and 

ie7° 

ci > 0, kn = O^ipl) and ^(7) = constant for all 7. Assume that s„ = s with 
s > a fixed integer, i.e., the true parameter vector contains exactly s 
nonzero components. 

(a) . Suppose 0„ = C2n^^ for some constants C2 > and 62- 

i. If —1<62< 1, then max p{'~f\Z)/p{'~f^\Z) — t- in probability, but 

77^7° 

PMC does not hold. Specifically, when — 1 < 52 < 1, p(7°|Z) — )■ 0, 
a.s.; when 62 = 1, then there exists a constant Cq with < cq < 1 
such that limsupp(7°|Z) < cq, a.s. 

n 

ii. If 1 < ^2 ^ ^1; then p{'~f^\Z) — ?■ 1 in probability. 

(b) . If = O(0„), then p(0|Z)/p(7°|Z) ^ 00 in probability, where 

represents the null model. Therefore, p{'y^\Z) — )■ in probability. 

(c) . If ncpn —> f] ^ [0,00), then almost surely, liminfmax p{'j\Z)/p{'j^\Z) > 

(1 + 77)^1/2 limp(70|Z) = 0. 

n 

The proof of Corollary 12.61 is given in Supplement B. 
Remark 2.5. The main contribution of Corollary 12.61 is to demonstrate 
the difference between PMC and (12. 4p . and provide example growth rates for 
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(pn under which the two forms of consistency fail. Although this is obtained 
in a special situation, similar results should be still true under a more general 
setting, for instance, where p < n or X'^X is not diagonal, but we do not 
consider those circumstances here. 

Corollary l2.6l (a) demonstrates that (12. 4p does not necessarily imply PMC. 
This means that, although the posterior probability of the true model might 
not be approaching one, the ratio of the posterior probabilities of any "incor- 
rect" model and the true model can still converge to zero. This phenomenon 
will not occur when p is fixed. In practice, (12. 4 p is sufficient to make a correct 
model selection even if PMC might fail. 

Corollary 12.61 (b) and (c) demonstrate that in order to make a cor- 
rect model selection, 0„ cannot be either too small or too large. Specifi- 
cally, when (pn = o{n~^), it follows by Corollary 12.61 (c) that almost surely 
liminf max p{'y\Z)/p{'-f^\Z) > 1. Thus, with probability one, for any e > 0, 

there exists an integer N such that for any n > N 

max p(7|Z)/p(7°|Z) >l-e. 
77^7" 

This implies that there exists a model, say 7*, such that p(7*|Z) > (1 — 
e)p{'y^\Z). Thus, when e is small, either p{'y*\Z) > p(7°|Z), or p(7*|Z) is 
very close to p(7°|Z), which will both affect the selection result. On the 
other hand, when 0„ is growing faster than n"'^°^", it follows from (b) that 
the null model will be preferred in favor of 7°. 

Corollary 12.61 (b) and (c) can be also understood intuitively. When 0„ is 
too small, the two distribution components in the mixture prior of /3 tend 
to be indistinguishable so that it is difficult to separate the true model from 
some incorrect model; when 0„ approaches infinity, by (12. 3p . the posterior 
probability of any nonnuU model approaches zero, and thus, all P/s are forced 
to be zero. This conclusion has been empirically obtained by Smith and Kohn 
(1996) under spline regression models. □ 

Remark 2.6. Using arguments similar to the proofs of Theorems 12.21 
and 12.41 and by the Borel-Cantelli lemma (see Shao, 2003), one can show the 
almost sure convergence of p{'y^\Z). We refer to Supplement C for details. □ 

To conclude this section, let us look at an example which demonstrates 
that, when (pn = 4> and (j)^ = with and (p unrelated to n, consistency 
might still hold under certain circumstances. This is motivated by a full 
Bayesian framework which requires all hyperparameters to be fixed. 
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Example 2.1. If a full Bayesian approach is desired, then we have to 
preselect the hyperparameters c/s, and so 0^ = and 0^ = could be 
fixed. Assume that fc„ = 0(1), which is a slightly weaker assumption than 
that in Jiang (2007). Note that Assumptions 12.61 and 12 . 71 follow immediately. 
Suppose min > -ipn with ipn oc n~^^^y/\og n, the prior distribution of 

model 7 satisfies Assumption 12. 1[ Assume that s„ = s with s > a fixed 
integer (thus, the true model is nonnull), and design matrix X satisfies fl2.10p . 
Therefore, by Proposition 12. II and Remark 2.2, Assumptions 12.21 and 12. 81 both 
hold. We also notice that Assumption 12.31 is well satisfied. It follows from 
Theorem 12.21 that if p oc rf for some < r < 1/2, then with probability 
approaching one, (12. 4p holds, i.e., the true model can be correctly selected; 
if p oc n'' for some < r < 1/4, then PMC holds in probability. 

3. Generalizations to gi-prior settings 

In section 2, we assume in the Bayesian model (12. 2p that the prior variance 
of a nonzero Pj is cja"^ with cj being fixed a priori. In practice, one may 
consider placing a prior distribution g{c) on the c/s, which reduces to the 
so-called (yf-prior setting (see Zellner 1986; Liang et al. 2008). In this section, 
we will give some asymptotic results under a ^f-prior setting. 

We consider the following variation in model (12. 2p : 

/3j|7j, a■^c ~ (1 -7j)5o + 7i^(0,ccr^), j = l,...,p, 
c ~ g{c), 

where g is a proper prior distribution on [0, oo). We still usep(7°|Z) to denote 
the posterior probability of the true model. Note that p{'j\Z) is obtained by 
integrating p(7, (3, a^, c\Z) with respect to (/3, a^, c). 

Theorem 3.1. Suppose that 7° is nonnull and Assumption 12.11 holds. Fur- 
thermore, suppose ||/3°||2 = 0(1), Sn = 0(1), min|/3°| > ipn with ip^ oc 

n~^/'^A/Iogn, and the design matrix X satisfies property (I2.10p . 

(i) Let the support of g be [0, 0] with < < < 00. If p oc n'' for some 
< r < 1/2, then ma.xp{'y\Z)/p{'~f^\Z) — )■ in probability. 

(ii) Let g be proper on [0, 00). If p oc n*^ for some < r < 1/4, then 
p(7°|Z) — )■ 1 in probability. 
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Theorem 3.2. Suppose that 7° is nonnull and Assumptions I2.H 12.31 and 
12.41 are satisfied. Let X satisfy f l2.10p . Suppose that 0^ and 0„ satisfy 
Assumptions 12.61 and 12.71 

(i) Let the support of g be [(j)^, If = o{n(f)^) for some ao > 2, then 

maxp{'y\Z)/p{'y^\Z) — in probabihty. 

77^7° 

(ii) Let g be proper on [0, 00) such that 1 — jf" g{c)dc = o(l). If p°o+2 = 

—n 

o{n(f) ) for some ao > 2, then p(7°|Z) — )■ 1 in probabihty. 

The proof of Theorems 13.11 and 13.21 is given in the Appendix. 

Remark 3.1. Theorems 13.11 and 13.21 provide sufficient conditions for 
(12.41) and PMC under a ^f-prior setting. They state that with large probabil- 
ity, p(7°|Z) dominates p{'y\Z) for any 7 7^ 7*^, and p{'y^\Z) approaches one 
in probability. In particular, the prior g in Theorem 13.11 does not depend on 
n, which corresponds to a full Bayesian framework, but we need to impose a 
narrow restriction on the growth rate of p, namely, that p is growing slower 
than n^/^ or ra^/^, corresponding to PMC or (12. 4p . In Theorem 13.21 a might 
depend on n, but we can allow p to grow faster with n. □ 

Remark 3.2. We conjecture, although do not rigorously prove, that the 
ranges < r < 1/2 and 0<r<l/4in parts (a) and (b) of Theorems 13.11 are 

optimal, in the sense that for any r > 1/2, ifp oc n^, thenma.xp{'y\Z)/p{'~f^\Z) 

77^7" 

does not converge to zero in probability; and for any r > 1/4, if p oc n*", then 
p{'-f^\Z) does not converge to one in probability. 

Remark 3.3. Liang et al. (2008) obtained model consistency under a 
mixture (7-prior setting. Their proof relies on the Laplace approximation of 
the integrals. While the proofs of both Theorems 13.11 and 13.21 rely on the 
uniform convergence in Theorem 12.21 □ 

4. Numerical results 

This paper has been concerned with asymptotic properties of Bayesian 
posterior probabilities. In this section, we briefiy explore the finite sample 
behavior of the model selection procedure for a few different prior settings and 
different rates of growth for p. Our basic approach is to simulate observations 
from model (II. ip . employ the model selection process, and summarize the 
results. 
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To construct random design matrices X, we generated iid p-dimensional 
row vectors Ui, . . . ,Un ^ ^(0, Ip) and let U he an n x p matrix with ith row 

Ui ioT i = l,...,n. Then we let X = [U^U)'^'^ . Thus, X^X = nip. 
(We choose X to be orthonormal for purposes of illustration, although, as 
we saw in the preceding material, results can be derived for general X.) To 
explore the dimension effect, we have considered three growth rates for p 
with respect to n: (1) p = n^^^, (2) p = rt}!'^ and (3) p = v?^'^ . Data were 
simulated from model (12. ip with a = 1, s„ = 2 and the true model coefficients 
(/JO, /30) = (2, 2) and {^l, . . . , = (0, . . . , 0). We considered sample sizes 
n = 100, 200 and 400 respectively. 

The hierarchical Bayesian model (12. 2 p was fitted and the prior distribu- 
tions on 0"^ and 7 were assumed to be l/o"^ ~ xl and p(7j = 1) = wj, 
for any j = 1, . . . ,p. We examined two cases for the w/s, namely. Case I: 
Wj = 0.5 for I < j < p; and Case II: wi = W2 = 0.3, W3 = . . . = Wp = 0.7. 
Case I places equal prior probabilities on all the models, while Case II places 
larger prior probabilities on the "incorrect" models. For simplicity, we let 
Ci = . . . = Cp = (pn- The values of 0„ were chosen to be 0„ = 10, 100, 1000. 
A total of 20,000 samples of {(3, 7, a) were drawn from the posterior distribu- 
tion p{f3,'y,a\Z) using a sub-blockwise Gibbs sampler developed by Godsill 
and Rayner (1998). We recorded the last 10,000 samples and treated the 
previous 10,000 samples as burnins. Convergence was assessed by apply- 
ing Gelman-Rubin's statistic to 5 parallel Markov chains for each 0„. If we 
denote 7^"^-', . . . ,^(10000) ^^^e last 10,000 samples of 7, then p{'j^\Z) is 

10000 

approximated by p(7°|Z) ^ ^ 1(7^ = 7°)/10000. 

t=i 

To study the frequentist behavior of p(7°|Z), we have generated 100 data 
sets Zi, . . . , ZioQ independently from model (12. ip . and for each 0„ calculated 
the corresponding 100 posterior probabilities p{'~f^\Zm), rn = 1, . . . , 100 as 
described in the preceding paragraph. This idea was inspired from Fernandez 
et al. (2001) who studied the Bayesian selection problem when p is fixed. 

Table [1] summaries the mean and standard deviations of the 100 p(7°|Zm)'s. 
We compared four settings. Specifically, Setting 1 to 3 correspond to (j)n = 
10,100,1000 under the Bayesian model (12.20 . and Setting 4 uses a hyper 
g-phoT with tuning parameter 3 (see Liang et al. 2008). Setting 4 was per- 
formed using the R package BAS available from http ://www. stat. duke, edu/r^ clyde/BAS. 
We observe that when p = n^/^, all four settings select the true model with 
high posterior probability. For the faster growth rates p = n^/^ and p = n^/^. 
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the results are more mixed. Generally, Setting 1 performs the worst and 
Setting 3 performs the best. In summary, when p is small compared to n, 
fixing (pn to be 10, 100 or 1000 will result in equally good results; when p is 
larger (compared to n), we recommend using 0„ = 1000 under model (12. 2p . 
if good asymptotics behavior is of interest. 

5. Conclusion 

Previous work about posterior model consistency (PMC) includes Fernandez 
et al. (2001) and Liang et al. (2008) when the number of parameters p is 
fixed. In this paper, we have studied PMC when the model dimension p 
grows with sample size n. Specifically, we have shown that, under a varia- 
tion of the Bayesian model proposed by George and McCulloch (1993), the 
posterior probability of the true model converges to one, i.e., PMC holds. 
We have obtained this result in two situations: (i) the design matrix X is 
general while p grows slower than n, e.g., plogn = o(ra); (ii) X'^X/n is the 
identity matrix and p may grow as fast as n, e.g., p = n. Furthermore, we 
have demonstrated under a special framework that the consistency results 
may fail if 0„ is too small or too large, where (pn is the hyperparameter con- 
trolling the prior variance of the nonzero model coefficients. More precisely, 
when 0„ = o{n~^) (an example of small order) or when n"'"^" = O(0„) (an 
example of large order), both PMC and consistency of the posterior odds 
ratio fail. Besides that, our results do not require that the candidate models 
are pairwise nested. 

Berger et al. (2003), Moreno et al. (2010) and Giron et al. (2010) have 
proved the consistency of Bayes factor when p is growing with n. This form 
of consistency, under our framework, is equivalent to the consistency of the 
posterior odds ratio if the prior odds ratio is uniformly bounded from above 
and below, so it is of interest to illustrate the relationship between PMC and 
consistency of posterior odds ratio. We have considered a special framework 
and shown that PMC implies consistency of the posterior odds ratio but 
the reverse may not be true. This is different from the finding by Liang 
et al. (2008) who demonstrate the equivalence of PMC and consistency of 
the Bayes factor when p is fixed. When combined with dimension reduction 
procedures such as SIS (Fan and Lv, 2008), our results can be also extended 
to ultrahigh-dimensional situations. We have also generalized the consistency 
results to a gf-prior setting studied by Zellner (1986) and Liang et al. (2008). 
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n = 100 
mean std 



n = 200 
mean std 



n = 400 
mean std 



Case I 



p = n 



1/4 



Setting 1 
Setting 2 
Setting 3 
Setting 4 



0.94 
0.98 
0.99 
0.96 



0.05 0.96 

0.02 0.99 

0.01 0.99 

0.04 0.96 



0.04 0.92 0.10 

0.02 0.97 0.05 

0.01 0.99 0.02 

0.05 0.95 0.05 



Case II 



Setting 1 
Setting 2 
Setting 3 
Setting 4 



0.86 
0.94 
0.98 
0.92 



0.14 0.91 

0.10 0.97 

0.05 0.99 

0.05 0.94 



0.08 0.86 0.10 

0.04 0.95 0.05 

0.02 0.98 0.02 

0.05 0.89 0.08 



Case I 



p = n 



1/2 



Setting 1 
Setting 2 
Setting 3 
Setting 4 



0.60 
0.82 
0.94 
0.68 



0.14 0.56 

0.10 0.81 

0.05 0.93 

0.10 0.63 



0.14 0.53 0.12 

0.11 0.80 0.10 

0.06 0.93 0.05 

0.13 0.62 0.12 



Case II 



Setting 1 
Setting 2 
Setting 3 
Setting 4 



0.34 
0.65 
0.86 
0.42 



0.12 0.29 

0.14 0.63 

0.09 0.85 

0.10 0.41 



0.11 0.27 0.11 

0.14 0.62 0.14 

0.10 0.84 0.09 

0.10 0.36 0.10 



Case I 



p = n 



3/4 



Setting 1 
Setting 2 
Setting 3 
Setting 4 



0.14 
0.47 
0.77 
0.21 



0.07 0.07 

0.13 0.38 

0.10 0.71 

0.08 0.16 



0.04 0.04 0.03 

0.12 0.33 0.10 

0.13 0.68 0.11 

0.05 0.16 0.06 



Case II 



Setting 1 
Setting 2 
Setting 3 
Setting 4 



0.02 
0.20 
0.55 
0.04 



0.01 0.00 

0.10 0.13 

0.15 0.48 

0.02 0.03 



0.00 0.00 0.00 

0.06 0.08 0.05 

0.12 0.41 0.12 

0.02 0.04 0.02 



Table 1: Means and standard deviations of the 100 p(pi^\Z„i) 's. Settings 1 to 3 correspond 
to (j)n = 10, 100, 1000 under the Bayesian model h2.2fl . and Setting ^ uses hyper g-prior 
with tuning parameter 3. 
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We close with an observation about extending the current results. As- 
sumption 12.71 is a technical assumption used to facilitate the proof and may 
not be the weakest possible. We leave it to future work to determine whether 
this condition can be further weakened or even removed. 



6. Appendix: proofs 

In this section, we prove the main results in Section 2. We also prove 
some lemmas which are useful to establish the main results. Let pr{-) denote 
the probability measure associated with the underlying probability space. 

Proof of Proposition 12.11 It follows by assumption that ^XTX;y > 

c/|^|. Letting = (^X^, X^\^^ , we can write ^X^X^y = ^ ^ ^ ^ , where 

A = X^Xy/n, B = X^X^\^/n and C = XTy^X^\^/n. By formula for the 
inverse of blocked matrix (Seber and Lee, 2003, page 466), the lower right 
corner of {^X^X^y' is B^^' with B,^ = C-B^A-'B = iX?;^(/„-P^)X^\^. 
Then B22 < c^'^I, which implies \-{B22) > c. □ 

Lemma 6.1. Suppose e ~ A^(0,o"q/„). Then: 

(a) . Let Vy = {In — P^)X^o\^/3°o\^- If 5*2 is nonnull, then max |i;?'e|/||f^||2 = 

Op(y^), where we adopt the convention that I'^^el/U'W^lh = when 
Vy = 0. 

(b) . If Si is nonnull, then for any a > 2, with probability approaching one, 

maxe-'"(P^ — P^o)e/(|7| — s„) < aa^logp. 

(c) . If S2 is nonnull, and we adopt the convention that e^P^e/ljl = 

when 7 is null, then for any a > 2, with probability approaching one, 
maxe^P^e/|7| < acrplogp. 

Proof of Lemma 16.11 We prove the result for the case where X is de- 
terministic, and briefly talk about the proofs for the case where X is random 
and independent of e. 

(a) We first assume that X is deterministic. By inequality (9.3) in Durrett 
(2005), if ^ ~ A^(0, 1), then there exists a Co such that for any t > 1, 
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pr{\^\ > t) < Coexp(-tV2). Note that |^;^'e|/((7o||v^||2) ~ N{0,1), and 
therefore, by Bonferroni's inequahty, 

pr I max > t] < pr ( > t 1 < Co2Pexp ( — . 

Then the result holds by setting t — Ca^yflp with large C. When X 
is random but independent of e, note that the conditional distribution of 
|f^e|/((To||f^||2) given X is A'"(0, 1). Thus, the proof can be finished by the 
above arguments. 

(b) Suppose X is deterministic. First, if C = X^) then by Chebyshev's 
inequality, for any 2 < a' < a, 

pr{^ > afi logp) 
= pr(exp(^/a') > exp((a/a')/xlogp)) 
< exp{-{a/a')iJ,\ogp)E{cxp{^/a')} 
= (l-2/a')"^^^exp(-(a/a')/ilogp). 

Using this inequality, Bonferroni's inequality, and the fact that when & Si, 
e^(P^ - ^7°)e ~ ^oX|7|-s„' we have 

pr max j— ^ ■ — > aar. logp 

\7e5i |7| - Sn J 

< ^pr {e^iP-r - > «^^o(l7l - Sn) log]?) 

< J] (1 - 2/a')-(l^l-^")/' exp(-(a/a')(|7l - «n) logp) 
leSi 

= E (^"^" (l-2/a')"''/'exp(-(a/a')rlogp) 

= (l + (1 - 2/a')"'''V"/''y" -1^0. 

When X is random and independent of e, then conditioning on X, e^{Pj — 
P^o)e ~ (JQx'\'y\-s„- Thus, the conclusion follows from the above arguments. 

(c) We let X be deterministic. The case where X is random can be 
handled similarly. Assume that 5*2 contains nonnuU models, and note that 
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when 7 is nonnull, e^P^t ~ a^x^^y Fix arbitrarily a' such that 2 < a' < a. 
Then by the proof of part (b) we have 

( '^Pl'^ 2, ^ 

pr max — — > aon. log p 

\i(iS2 |7| ; 

= pr max — — — > aao. loe p 

\i&s2\m |7| y 

< ^ pr {e^P^e > aa'^l^fl logp) 

7G52\{0} 

< J] (l-a72)-l^l/2exp(-(a/a')|7|logp) 
7eS2\{0} 

r=l ^'^^ 

= (l + (l-2/a')"'/V"^"')''-1^0. □ 
Proof of Theorem 12.21 We have 

-.o.(.WW,Z,) ^ -,o.(i;^).i.„.(^) 

n + z/ / l + y^(/„-X,[/-iX^^)y \ 
2 ''^[l + yT{I„-X,oU-,'X^,)y) 

°Hp(7°); 2 °Hdet(iy,o); 
H — log ' 



2 "V l+yT(J^_p^)y 



log 



l + y^(/„,-P^o)y 



Denote the above summands by Ti, T2, T3, T4, T5. By Assumption 12.6^ Ti is 
bounded below. Since > X^X^, we have T3 > for any n. 
To approximate T4, let 

A = y^X^O (Xy^X^o) (T.yO + (XyiX^o) j (XyiX^o) Xyiy. 
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By the Sherman- Morrison- Woodbury matrix identity (Seber and Lee, 2003, page 
467), 



(6.2) 

By (I6.2p and the fact that ^E^o + ^X^iX^o j ^ < E~o^, we have 



l + y^(/„-P^o)y 
A 



< 1 + 2 



< 1 



l + y^(/„-PyOy 



)Ts-i/30^, + e^X,o(X^^oX,o)-iS:J(X^^oX,o)-iX^^oe^ 



l + y^(J„-P,o)y 



1 1 2 + e^X^o (Xy, X^o ) ^X^o e 



l + y^(J„-P7o)y 



Since y^(/„-P7o)y/n = e^{h-P^o)tln al and E{e^X^o{X^,X^o)-^X^oe} < 
croS„(nv9min(^))"\ we have e'^X^o(X^oX^o)-2X^oe = Op (Sn(nv9min(n))"^). 
Therefore, by Assumptions 12.21 and 12. 3[ and the fact that kn > s„-?/'^, we 
can show that 



l + y^(/„-X^of/VX^o)y 2k 

■ — < 1 H ' — 

1 + y^(4 - ^70 )y nda^ 



<i + rT^(i + «p(i))- (6-3) 



Consequently, < — T4 = Op(l) follows from the condition that /c„ = 0(0^) 
(Assumption 12. 7p . 

Next we approximate T2 and T5 in the following Lemmas 16.21 and 16.31 



Lemma 6.2. Under Assumption 12.81 if 7 G ^i, then uniformly for c^'s 
G T2 > 2-i(|7| - s„)log(l + C3n^'^(I^J. Under Assumption [521 if 

7 G 5*2, then uniformly for c/s G [0^, 0„], T2 > — 2^-'^s„ log(l + C2?T.0„), where 
C2 and C3 are constants given in Assumptions 12.21 and 12.81 respectively. 

Proof of Lemma 16.21 If 7 G 5*1, it follows from the determinant 
formula for block matrices (Seber and Lee, 2003, page 468), and Assumption 



21 



that 

det (f/^) = det (?7^o ) det (^S-^^o + X^\^o (/„ - X^o U-,}X'^o)X^\^o 

> det (f/^o ) det (s-i^o + X^\^o (/„ - P^o )X^\^o ) 

> det(?7^o) det (s-^i^o + C3n'-'/|^\v)|) . 

Therefore, 

det(iy^) _ det(S^) det{U^) 
det(iy^o) ~ det(S^o) det(f/^o) 

> det(E^\^o) det (^^:^\^o + C3^^"'^/|7\70|) 
= det (/|^\^o| + C3n^~''S^\^o) 

> det((l + C3ni-VjWl) = (l + C3r^'^'l)'''"'"'(6-4) 

which shows that T2 > 2~^(|7| — s„) log(l + CsU^^^cp ). If 7 G S2, note that 
det(PV^) > 1, and by Assumption 12^2] 

T2 > -^log(det(W^^o)) > -ilog(det(J,„+C2nSy,)) > -2-is„ Iog(l+C2n0„), 

which completes the proof of Lemma I6.2[ □ 

Lemma 6.3. Let ao > 2. If either Assumption 12.41 or 12.51 is satisfied, when 
n is large, with large probability and uniformly for 7 G Si, T5 > — 2^^(|7| — 
Sn)aologp. If both Assumptions 12.21 and 12.41 are satisfied, there exists a 
constant C such that when n is large, with large probability and uniformly 
for 7 G ^2, T5 > 2-\n + u) log (1 + C^n)- 

Proof of Lemma 16.31 We consider 7 G 5*1 and S2 separately. Notice 
that Assumption 12.41 implies that plogp = o(n log(l + V'^)), and therefore 
implies that plogp = o^nipn)- Let = {In — P^)X^o\^/3°o\^- From Lemma 
16.11 (a) and (c), there exists C > such that when n is sufficiently large, with 
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large probability, for any 7 G S"; 



2, 



y^(/„ - P,)y 



> 
> 

> 
> 



l-2C^\\v42 + e'e-CWogp 



-^7 II 2 



6^6 



2 ' 1 - 2C 



plogp 



(l + o(l)) + e^e 
n</^mmHII/3;o\^||2(l + 0(1)) + e^e 



> nv?min(«)^^(l + 0(1)) + e^e. 



(6.5) 



It is easy to see that Assumption 12.41 implies that s„ = o(n), and therefore, 
e'^(/„ — P^o)e = naQ^l + Op(l)). Thus, by (16. 5p . there exists a C such that 
for sufficiently large n, with large probability, uniformly for 7 G 5*2, 



n + u 



log 



1 + n(^n,in(n)^^(l + 0(1)) + 6^6 

1 + e^(4 - Pyo)e 



> 



n + u 



log (1 + C'^Pl) . 

(6.6) 

On the other hand, by properties of projection matrices and Lemma [6.11 
(b), when n is sufficiently large, with large probability, we have uniformly for 

7 G 5*1, 

l+y^(4-P,)y 

l + y^(/„-PyOy 
^ y^(P,-P,o)y 



l + y^(/„-P,o)y 

30 \TvT 



1 - 



(/30 )^X^,(P-, - P,o)X^o/3^o + 2(/30 )^X!;,(P, - P^o)e + e^(P^ - P^o)e 



e'^(P^ - P^o)e 
l + e^(/„,-P^o)e 



> 1 



l + y^(/„-Py,)y 
"(I7I - s„)logp 



n 



where we have temporarily fixed an a such that 2 < a < \f2a^. It follows 
by the inequality that log(l — s) > —{oLj2)x when a; G (0, 1 — 2/q;), and by 
Assumption 12.41 or 12.51 (which both imply that (I7I — s„) logp/ra approaches 
zero uniformly for 7 G S\) that for sufficiently large n, with large probability 
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and uniformly for 7 G S" 



I5 



7^5>— ^logll )-~ (I7I - Snjaologp, (6.7) 

which completes the proof of Lemma 16.31 □ 
Now we are ready to finish the proof of Theorem [221 By (16. 3p , Lemma [621 
Lemma [^T^ Assumption 12. 4[ and the fact that p"" = o{pn) with p„ = n^^^cp^, 
with large probability, uniformly for 7 G S*! and Ci, . . . , Cp G [0^, 0n], 

p(7|^)M7°|^) < C exp(-2-i(|7|-s„)log((l + C3P.)/p"«)) 



^■l + CVnj ^0. (6. 



-2-^(|7|-«n) 

pao J 

By Assumptions 12.41 and 12. 6[ it can be verified that s„log(l + C2n0„) ^ 
■2y^log(l + CVn)- So, with large probability, uniformly for 7 G ^2 and 

Ci, . . . ,Cp G [0^,0n], 

p(7|^)M7°|^) < C exp(^2-Vlog(l + C2n0„)-^log(l + CV^)j 

< C'(l + CVD-'^^0, (6.9) 

where C in (16. 8 p and (16. 9p depends on the lower bounds of Ti and T4. For 
the proof of PMC, we consider two cases. It is easy to see from (16.80 that 

2-Ml7|-«n) 



5:p(7|z)/p(7»iz) < c5:(i±^) 

P - S„\ / 1 + Cgp, 



7" / \ p'^" 



1 + 



1 + C3P„ 



AO 



^0, 



where the last limit result follows from the assumption that p"o+2 _ o(^p^). 

Similarly, by (16. 9p . and plogn = o(nlog(l + "ipn)) (which follows from 
Assumption I2.4p . we can show that 

J2 p{l\Z)/p{i'\Z) < C2^{1 + CV^)-("+'^)/^ ^ 0. (6.10) 
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This completes the proof of Theorem 12.21 □ 
Proof of Theorem 12. 4L The assumption that 7° is null implies that 
the model class S2 is empty. Similar to the proof of Theorem 12. 2[ we need 
to approximate Ti to T5 in (16. ip . This is easier when the true model is null 
since T4 = 0, and by Lemma \6.2\ when 7 is nonnull, T2 > 2~^|7|log(l + 
C^n^'^cj)^). Since Ti and T3 are still bounded below, the proof is reduced to 
approximate T5. By Lemma ESI Assumption 12. 5^ and that s„ = 0, when n is 
large, with large probability and uniformly for 7 e ^i, T5 > — 2^^|7|ao logp. 
Therefore, the remaining proofs can be finished by arguments similar to (16.81) 
and (I6T0|) . □ 
Proof Theorem 13. IL (i) Let p{'-f\Z, c) be the posterior probability of 7 
given Z and c, as specified by (12.31) . Applying Theorem 12.21 we have that in 
probability 

sup maxp(7|Z, c)/p(7°|Z, c) — > 0. 

Then the result follows from p{j\Z) = p{'y\Z, c)g{c)dc, and 

!tv{.l\Z,c)g{c)dc 

< sup maxp{'y\Z,c)/p{'j \Z,c). 

JJp{^'^\Z,c)g{c)dc CG[^,0]7^7« 

(ii) Let < < 0. By Theorem 12. 2 [ inf_ p(7°|Z, c) — )■ 1 in probability. 

— ce[^,0] 

Since 

p(7°|Z)= l\p{^^\Z,c)-l)g{c)dc+ [%ic)dc+ [ p{^'\Z, c)gic)dc, 

J(j>_ Jct>_ J[O,oo)\[^,0] 

the result follows by fixing and so that g{c)dc is close to 1, and letting 
n go to 00. 

Proof Theorem 13.21 Proof is similar to those of Theorem 13.11 



Supplement Materials 

Supplements A-C are given in the authors' website: 



http:/ /www.stat.wisc.edu/~ shang/ 
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Supplement A: Generalizations of Bayesian consistency to ultra-high dimen- 
sional settings. 

Supplement B: Proof of Corollaries 12.51 and 12.61 

Supplement C: Almost Sure Consistency of p{'-f^\Z). 
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